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iBSTBACT 

In the final stages of drafting the 1974 revisipn of 
the Standards for Educational and Psychological Tests, it becaae 
-clear to the coiaittee aeibers that "the Taripu^ drafts of the 
revisions eaphasised the assessaent? of individuals, rather theln 
progcaa evaluation. Becaas^ of its concern, the ccaaittee decided to 
finish the present drafC and to recopaend the preparation of a 
ooapanion velaae pf standafdi's for prograa irvalaation to the three 
isponsoring bodlea^-^the Aaerican .Psychological Association, the , 
, Aaerican Educational Besea|;ch Association, and the Hational Council 
'..on*. Be^coreaent in Education; This btief report outlines the contents 
d'f t:he Maoran'dua proposing the coapanion volaae, and describes th^ 
initial actions taken by the three sponsoring associations, on the ' 
basis of solicited evaluations. (Aut^or/flf) \ . 
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* Baproductions supplied by BOBS are the best that can b^ aade * 

* ' froB the original^ docuaent. ' * 
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This 00CU\^6NT maS e66N REPRO- 

'."^ O ' • * * ' OUCEO ExaCTlV AS R6CCIV60 FROM 

^ Back^rpund of the Project^ €o Develop Guidelines I'^Z'Vto^i^^^^^^^ 

to T«E EoucAT.oNf t RESOURCES Standards for Educational i:valuation 

INFORMATION 6EMTER .(6RlC) AND " i ' , . EDUCATION SOSiTiON OR POLICY 

• USERS OF TH€6Ric SYSTEM " ^. . Ceorge F. Madaus ^ ' 

* • Boston College 

/ In the final stages of drafting the 1974' revision o^ the Standards for Educa- 
tional" and Psychological" Tests it tjecaine clear ro'the conmiittee members that the 
^' * various 'drafts of the StanddiTds revisions g'ave primary emphasis to standards that . 
^dealt with the assesOTient of individuals. ^Thile acknowledging problems^ in the use 
of testis in e valuation ;^prograins, cUrricula, therapeutic treatment, etc., the draft 
in- its overall content'^and tone^ cleatly focused on decisions affecting an Examinee 
Q by an en^loyer, personnel manager^ guidanc:e counselor .College' admission officer, 

vocational counselor, clinician, teacher, etc. ' ^ ^ - 

^^"^ ' ' ' » , ^ 

^ , A.petusal of -the lettertJeads over >?hich,indivi^iuals submitted reactions to the 

)^ Final draft of the revisions revealed that the bufic of the testicpny came from 
^ ' individuals representing private firms- or associations with interest in employment 
Q and personnel selection; from governmental agencies engaged^ in testing for employ- 
\ i I ^ m^rit or monitoring "of omploym^nt practices; from associations representing minority 
groups; and from legal assistance groups with interest in unfair hiring practices.- 

"Individuals associiated wi^ ^irms directly engaged in large scalfe program * 
evaluation and/or i^olicy researc:h either submitted no testimony at all, or in the 
. case of ''two firms, submitted comments which dealt- with matters of individual ass^ss.- 
roent of criterion referenced measurement. That^the focus and flavor of the re-^ 
vision, like that of, the 1^66 version, wasr primarily on the use 'of , tests for the 
assessment of individuals was "Of course not storising. ^ The growing^ number of 
court , cases involving the use o% tests in civil service, private industry, and the"" 
<iivil rights area he^ed to shape the tone and thrust ^ of the 'draft. 

Thus* the conmittee became concerned* that while the draft document did have 
-things to say to those using tests in program evaluation-- the sense the document 
conveyed*' through its choice of laJiguage and exan^les was primarily one of a set of 
standards focused on the more traditional use of tests in individual assess.ment, , 
Because of .this cbpcem the. committee deliberated on whether to begin a new draft^ 
which would include additional standards for test use in the program^aluation 
, domain, or v^ether to finish the present draft and to recommend to thr^^three spon-^ 
/soring bodies the preparation of a conpanion volume of standa?-ds for program evalu- 
ations. To help in these deli4>e rations I was asked as a Jt^eipber df the committee 
tQ prepare a memorandum on the issues that might be Wdressed in a companion 
.volume. At that t|jne th^re was a pressing need to issue a revision of the 1966 
Standards that dea^it with issues dealing with the use of tests in selection and 
employment. After further deliberation the joint committee decided to re^cogmend " 
' to APA, AERA and NCME that a,-<:ompanion yolvutfe be developed. Tlie memorandum became 
the bksis of a proposal for/ a conpanion volume and was circulated by the Board of 
Scientific Affairs of APA t6 approxiijately* twenty individuals intimately concerned 
with large scale evaluation "iiifd policy research for their reactions as to its 
.merits. * ♦ . ' ' , 

In the. time r^ipaining I will briefly outlijoie the contents of the^ memor;andud 
reactions to the* proposal for a conpanion volume and the initial steps taken by , 
APA', 'AERA and NCME on the basis of the soliffted evaluations. - * 



' / In outlining ti^ caritents of t|ie memoranduii' I must of necessity he very brief 
-amd consequfehtly I will for the most part omit specific examples used 'in -the memo- 
randum to illustrate the need for a corftpanion vo^faAier However, those of you wislv- 
ing a copy of tl^ memorandum can writte to me at* Boston College.,. 

» * - The 1974 Memorandufn 



After giving a bjief history of the 'impetus behind the 1966 Standards and the 
sxihsequent growth after 1966 of program evaluation, the aemo asked whether \the pro'^' 
posed revision adequat^l^ dealt Mjith issues related to test use in the following . 
'isituations ' v * ^ ' ' 

(a) The evaluation of tfie effectiveness of 'govemirient sponsbre'd 
•educational interventions such as Head Start, Titles I, III, 
and VII of the Elementary And Secondcfry Education ACT^ (ESEA) , 
Follow 'Through," METCO, etc. ' . ' v J% • - ♦ 

(b) " £ducation«l' researph affecting public policy. This includes ^ ^ 

research affecting p\iblic policy, * This includes research v 
similar to that reported by^Coleman (19661 / Jenck^ al / (19,66), 
\ Jensen (1969), Hermstein C1972) , Armor (1972)", etc* 

^ (c). . The^ formative and summative evaluation- pf laarge scale cur-*- 

riculum development projects^ (e.g..,' BSCS; Harvard Project^ y /• V 
Physics; the A*esthetic Education Program) and other educa- * 
f tional prodiicts and packages produced by the Regional 

Laboratories. Programs like Sesame Street, gOOM, The Elect^ric 
Con5>any, etc., developed undfer governmental and foundation 
grants could also be included under this category. 

In addition to these three categories there were other developments related to 
the testing movement vhich -were no^" addressed in the draft revision and which 
called for a consideration of 'possible standards 'of test Mse. These include the 
National Assessment Project ,' statewide needs assessments, performance contracting, / 
criterion referenced testing, ancj ac covin tabi li ty ► . 

Program Evaluation , * 



Under the category of program evaluation th^ following issues v/ere raised 
in thk memorandum: * * . > . ^ 

(1) The implications of usihg tests primarily designed to maximize 
indivi^duai differences for the evaluation of group performance. 

(2) The' implications of using norms based on individual performance 
^ . for evaluating group perfomjance. 

(3) The proliferation of discourse on the properties of new^tests 
(e.g., criterion Referenced tests) cried out foi* definitive 

, treatment; 'standards relevant to the development, use and inter-^ 
/" ' pretation of alternatives to normed referenced *ests needed to 
^* . ' be "develoj>ed. 



(4) "New" tests apart, current^ test interpretation in p^-ogram evalu- 

' ation could , benefit from new standards-. » Evaluations of federally i 
funded prograins lean heavily on the meai^ureineiit of educational ' 
» progress or. groiirth. :The increasing use of . standardized achieve- 
ment tests in the measu^^-ement ^f growth posed ,sp^pial problems 

not cov^reii in .plresent staiidardized^test manuals or in the draft ' " , t 

r ^ revision of the Standards. ' Th^ implications ♦of using various * * 
. • . derived test metrics such as the GE to' operationalize "growth" • ^ 
^. * needed a -detailed explication, ^uestidn^. associated with problems 
of analyses of gain scpres so crucial fn longitud^^nal studies like* 
Head Start andFollow Through wejre not considered in the^draf t * * 
. revision. ' . * • 4 * , ' » . *r ' 

* . It wets argued that a new set of standards could bring together 
the best thinking on these ' important issues with a fbrce amd ^ • , ^ 

authority that would persuadfe companies performing evaluations, 
and test publishers in their manuals, to address themselves more ' 
carefully to prtobiems of growth and gain. ' , ' * 



^ • (5) Another issue freguently encountered in program evaluation, not 

-covered in the proposed draft was that of regression artifacts. ' • ' ' ' 

(B)^ Use of T ests in Public Policy- Research ^ . • ^ 

' V It deemed clear that public policy debate is often, inf laenced by inter- * 
ferences drawn fran test^data. There is little doubt that studies by Jensen (1969), . 
V Hermstein (1971), Armor (1972), Jencks e't al, U972) , Coleman et: al (1966) ancj the * >i 

' ' reanalysis of the Coleman^ data by Hosteller and Moynihah (1972), have all in- 

• flt^enced the dialogue concerning educational policy, leg^slatio^and funding prior-: 
^ , ities for educational research, 'ttie validity of the conclusions frojn all of these 
studies rests primarily, on inferences made from test data. A set of stairidacrds 
specifically geared to the use of tests in policy related research could have pro- 
vided a frauiewor^c within which a more rational debate of the merits of tijese studies 
could have taken place; ji&t as the 1966 Standards acteli as a^ framework within 
v^ich l^e issue of test validity in discrimination cases was argued. * * , * 



'Coleman and Karweit (1972) .offered three headings under which test results 
have been used to jneasure school and prograii effectiveness in policy research* The 
memorandum adopted these headings 1 I ^ * . 

* ; ^ ' ^ . . . ^ , > 

!• Depicting the level of functioning of Stahdards in an already existing ' 
program, school or school district. ' / ; 

Witness the yearly public at ion. in such papers as the New York Ti?nes 
or the Bost3t-€lobe of school average reading scores at different grade 
, ^ levels. The revision of the Standards did not . adequately deal with this 

type of test reporting or the, misleading inferences to which these * - 
seemingly straightforward data lend themselves* . 

2* Inscribing the ipg>act oY a special program .with a, def jifit^ starting" point s 

The Wolff and Stein^ (1966) study of the impact of summer H^ad Start . * c 
prograins Is a case ^ijj/point under this )ieading * ; , ^ ^ 



3* Using test^ scores as "dependent variables" in research aimed at separating ^ 

effects of stude nt background from those of school environment. 

• ] ' ' — : = 

' Coleman's final cate^ory^ is applicable to hi ^ .work, ^he lEA study (Husen, 
1967) is another example •in this category. There are many issues subsumed 
under this category. For' example, in the Ploj;den Report, th^ question arises 
^ o^ how best to treat Time 1 ac)iievement scores in an ^alysis which seeks to 
account for 'Fime 2 achie^eiuetit differences. Depending oh the method used, t|ie 
policy implementations differ considerably as has been outlined by Acland ' 
(1973); ' , ' • . 

Another issue involves whici: derived achievement score should be used in ' ^ 
, analysis and how mean GE scores should -be computed.-^ For Example, an evaluatoif can 
coir5)ute the mean of a group «of raw scores (or "standard scpres") kjf^'convert that 
mean' to_another score metric such as a GE (or "standard score", <^atevei? the case 
may be) for the group. Using the same set-<?f scores, another evaluator could cpn- 
vert each raw score to a GE arid compute the resulting mean of the GE distribution. 
The two meaJt GE's will not necessarily -be the Same.* 

Since test validity refers to the accvuracy^of^ inferences from test scores, the 
inferences made from a gr9up's actual test performance (i.e., number right) 'should 
riot vary from one score to another. ♦ 

- Finally, there have been serious proposals to use test results to allocate 
f\inds for programs for the educationally disadvanraged. The implications of using 
various derived test scores in ^an allocation formula and as th^ basis for continu- 
ation of f\ai]ds have not been fully thought through. standards could inform the 
policy maker of this important and sensitive area, * 

(C) Evaluation of Large Scale Curriculum Development Projects 

The intent of the evkluatio^ process in R&D is to assiare quality control 
for eventual jproducts. However, the evaluation effort varies widely across de- 
velopiaent projects. The proliferation of literature on what constitutes good eval- 
uation has* not 'as yet been synthesized into a set of recognized st^dards. It was 
felt th^t'a set of standards suitably promulgated would force educational ehtre- 
prenevurs to reconsider their responsibilities for quality control. 

The memo asked whether or not standards ^re needed (a) regarding the 
eissessment kprocedures used to evaluate the various con^nents of the, product 
while \inder development (formative ev^uation, (b) , regarding the assessment pro- 
cedures used to waluate the products' effectiveness upon completion and release 
for sale (summ^ive evaluation) , (3) regarding the type and quality of the infor- 
mation that ,tJie developer/publisher must provide the consumer. The standards * 
called for iA the iast instance are analogous to thoSe governing test manuals.; and 
(d) regarding the procedures used to determine the educational as well as the cost 
effectiveness of the particular package \inder evaluation vis a vi^s other packages 
pturporting to acconplish the same objectives . This type of con^arative curriculxam 
evaluation encounters many of the difficulties discussed in section (A) above. 

^!P\e difference can range from one to four tenths of a G.E., based on simulated 
data using the California Test Battery • * - • 
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If Waluation is really an' integral part of curriculum development , -if its , 
f\inction is to inform that development ar^d protect the cons^er, then a considet- 
ation- of standards relative .to the evaluation process in R&D, AE^, and ilCME ^ 
appear -to be in order: - ♦ , . " * 

4 

D. Other tissues • . . . * 

It'was felt. that, several otrfer developme,ni:s in testing during the past sever 
al years needed scrutiny in terms of standards, Fo*r these, no attempt was made to 
develop the related measurement issues. Instead, the memo suggested that there 
are iir5)ortant public policy issues relevant to test^iisage in the areas of statewide 
assessment, accountability, performance cTS^racting and the National Assessment 
Project. A new committee would need to scrutinize these areas with an 'eye toward 
suggesting standards where needed. • - ^ 

Review of the Memorandum 

, In brief then, these were the issues described in the menjorandum. The evalu- 
ation of . the. memorandum was favorable. iMost of the reviewers felt that there was 
a need for standards or at least guidelines in the areas outlined. Further, it was 
generally felt that the time v;as ripe to begin this \indertaking. * riany individuals 
cited additional issues beyond test use that the new committee should also address. 
These included: , . ^ 

--ethical/i^ilo^ophical issues . ' 
— control of/iyias ^ 
— evaluation of evaluation 

— essentials to be included in an evaluation report 
— centraiity of valuing process ^ 

— evalua tor's relationship to grograms evaluated, funders of evaluation, 
* audiences f and general public. 
— analysis techniques 
^ —standard telated to tactics of consciously or \inconsciously reducing 
the quality o? pretests (te^t-adminis traitor malingering) 
' — standard relating to preserving data for "reanalysis (freedom of 
information) and protecting, participant privacy 
— experimental desigr^ considerations 
-Tpresentatioi^ of r'esults 
— groTind rules' for/^fesearch 

— responsibilitiesNin serving dif ferent^^audiences 
— experimental design and statistical control - , 
^-matrix sanqpling testing' procedures 
— legal questions ^ ^ 
—minimum specif icat ions 'for evaluatipn contracts 
1, — conceptualization of evaluation 
— ^^tate^d ' federal regulations 

— organization arid management conqei:ns ^ 

>urthet, the reviewer ci/ted several excellent' examples of work already \inder 
way which would be. relevant to the' development of a new set jDf standards for pro- 
gram evaluation*. These- included work by Tyler, Scriven, Messick, Stake, Lumsdaine 
\Novic)fj^ Fjreeman, Bederman, Wholy, Eash, Stuff lebeam, Coleman, Palmer, Sanders, and 
ETS. ' . ' ' 



On the basis of these ^favorable reviews the three agencies which spgnsored 
the 1574 Revision of the Standards appointed a committee composed of Egon G^ba of .. 
AERA, Eton Campbell. Robert Senn and flenry Reichen of APA. Ron. Carver. Dan Stuffle- . 
beam and myself from NQdE to decide upon the, next steps to be taken in the de- . 
velopment of a companion volume. That Committee met .in Chicago in May of 19 
.and 'I think this point' is best taken up by our next speaker. Dan Stuff lebeam. 

References 

Acland, H.D. Social -determinants of educational achievement: an evaluation and. 
criticism' of research. Ph.D. thesis submitted to Oxford University, 1973. 

Armor, J.^ liie evidence on busing. . The Public Interest , No. 28, (Suiraner 1972) , 
pp. 90-126. ^ ^ ° I ; 

Coleman, J. S., et al. Equality of educational opportunity.. Washington, D, C. : > 

U.S. Government 'Printing Office, 1966. ^ , 

C0lema[n,-J.S. and Karweit,'N. Information sy stems and performance measWs in 

Schools . Englewood Cliffs, N.J.: Educational Te^chnology Publishers, 1972. 

Herrnstein, R.' IQ. The Atlantic 'Monthly , Sept. 1971, 43-64. 

Husen. T.- international study of achievement in mathematics. New York: Johh. Wiley 
and .Sons., 1967. • , • , ♦ ' 

jencks, C. et al. Inequality: a reassessment of the effect of family^ and schooling 
' in America'. -New York: Basic Books, 1972\ • • ^ . . 

Jensen, A. R. How much can xve boost IQ aiid' scholastic achievement? Harvard Edu- 
.cational Revit^v? , 39, 1969, 1-23. 

Wolff, M. and Steih, A. A comparison of- children who had Hea,d Start, Sumitrer l965 - 
with their classmates in kindergarten. New York: Ferkauf Graduate School, of 
Ed\ication, Yeshiva University?, 1966 (mimeo) . " . 

Hosteller, F. and Moynihan, D.P. (Eds.) Ori equality of educational, opportunity . 
New York: Randoitf House , 1972. \^ y 



V 



> '7 



